Machine learning–based response assessment in patients with rectal cancer after neoadjuvant chemoradiotherapy: radiomics analysis for assessing tumor regression grade using T2-weighted magnetic resonance images

Purpose This study aimed to assess tumor regression grade (TRG) in patients with rectal cancer after neoadjuvant chemoradiotherapy (NCRT) through a machine learning–based radiomics analysis using baseline T2-weighted magnetic resonance (MR) images. Materials and methods In total, 148 patients with locally advanced rectal cancer(T2-4 or N+) who underwent MR imaging at baseline and after chemoradiotherapy between January 2010 and May 2021 were included. A region of interest for each tumor mass was drawn by a radiologist on oblique axial T2-weighted images, and main features were selected using principal component analysis after dimension reduction among 116 radiomics and three clinical features. Among eight learning models that were used for prediction model development, the model showing best performance was selected. Treatment responses were classified as either good or poor based on the MR-assessed TRG (mrTRG) and pathologic TRG (pTRG). The model performance was assessed using the area under the receiver operating curve (AUROC) to classify the response group. Results Approximately 49% of the patients were in the good response (GR) group based on mrTRG (73/148) and 26.9% based on pTRG (28/104). The AUCs of clinical data, radiomics models, and combined radiomics with clinical data model for predicting mrTRG were 0.80 (95% confidence interval [CI] 0.73, 0.87), 0.74 (95% CI 0.66, 0.81), and 0.75(95% CI 0.68, 0.82), and those for predicting pTRG was 0.62 (95% CI 0.52, 0.71), 0.74 (95% CI 0.65, 0.82), and 0.79 (95% CI 0.71, 0.87). Conclusion Radiomics combined with clinical data model using baseline T2-weighted MR images demonstrated feasible diagnostic performance in predicting both MR-assessed and pathologic treatment response in patients with rectal cancer after NCRT. Supplementary information The online version contains supplementary material available at 10.1007/s00384-024-04651-6.


Introduction
The standard treatment for locally advanced rectal cancer (LARC), defined as clinically determined T3/T4 or nodepositive cancer, consists of neoadjuvant chemoradiotherapy (NCRT) or short-course preoperative radiotherapy followed by total mesorectal excision [1].Response to neoadjuvant treatment correlates with long-term outcomes, and is a crucial factor in treatment planning in patients with LARC.Approximately 15-27% of patients with LARC achieve pathologic complete response (pCR) after chemoradiotherapy (CRT), which indicates no residual viable tumor cells in the resected specimen, and this pCR is associated with improved 5-year rates of local control, distant recurrence, disease-free survival (DFS), and overall survival (OS) [2,3].However, pCR can only be confirmed by pathologic examination of surgically resected specimens.
Response after CRT can be noninvasively assessed by rectal MRI using the mrTRG scoring, which is a 5-point grading system derived from pathologic tumor regression grade (pTRG) by the Magnetic Resonance Imaging and Rectal Cancer European Equivalence Study (MERCURY) group [4].Previous studies have reported significant differences in DFS and OS between the good and poor response groups based on mrTRG [5,6].MRI-based good responders can be candidates for watch-and-wait (W&W) approach, and a previous multicenter registry study on W&W strategy revealed a 5-year disease specific survival of 94% [7].Therefore, treatment response assessment using mrTRG could be a useful substitute for pTRG in preoperative decision-making and guidance for organ preservation strategy.
Radiomics involves the extraction of mineable, highdimensional data from radiology images and has been applied to oncologic radiology to derive meaningful information on tumors from the extracted data [8].Radiomics can provide information about the entire tumor phenotype and microenvironment that is difficult for radiologists to assess visually.Recent researches have conducted radiomics analyses of pre-treatment baseline MRI for response prediction [9,10] and have mostly focused on pCR prediction [11][12][13].
In addition to pCR prediction, assessment and prediction of MRI-based treatment response would be important because of the increasing need to identify candidates for the organpreserving strategy.Radiomics analysis can assist in the non-invasive prediction of preoperative treatment response, and be used as a screening tool to select candidates for the W&W strategy.
Therefore, this study aimed to investigate radiomics analysis and develop a machine-learning-based treatment response prediction model using T2-weighted images (T2WI) of baseline rectal MRI in patients with rectal cancer to identify good responders based on both mrTRG and pTRG.6) underwent metallic stent insertion (n = 4); (7) tumor recurrence after previous treatment (n = 3).Altogether, 148 patients were included for development of mrTRG prediction model.Among them, 104 patients were included in the development of pTRG prediction model after exclusion of patients: (1) who did not receive surgery (n = 20), (2) who received surgery > 6 months after post-CRT MRI because of the watch-and-wait approach (n = 12), and (3) who had no available pathologic reports regarding tumor regression grade (TRG) (n = 12).The patient selection process is illustrated in Fig. 1.Clinical data, including pathologic tumor differentiation grade from the initial endoscopic biopsy, mean CRT duration, mean interval between CRT and post-CRT MRI, mean interval between post-CRT MRI and surgery, and serum carcinoembryonic antigen (CEA) level at diagnosis (ng/mL, within 1 month of MRI), were recorded.

Patient selection and clinical data
This study followed the Standards for Reporting Diagnostic accuracy studies (STARD) reporting guidelines [14].

Magnetic resonance (MR) image acquisition and protocol
Baseline MR images were acquired with Achieva 3T (Philips Healthcare system, 134/148; 90.5%) and MAGNETOM Vida 3T (Siemens Healthineers System, 10/148; 6.7%) using a 32-ch coil.High-spatial-resolution axial oblique T2WI, which were obtained perpendicular to the long axis of the tumor and/or area after treatment, were used for both visual assessment and radiomics analysis.Diffusion-weighted imaging (DWI) (b-value = 0.1400) with an apparent diffusion coefficient (ADC) map was used when necessary.The MR parameters at our institution are presented in Supplementary Table 1.

Radiomics analysis
Image segmentation: One radiologist (M.W. Y. with 15 years of experience in rectal imaging) manually segmented the entire tumor mass and area after treatment within the rectal wall on each slice of oblique axial T2WI of baseline and post-CRT MR exams using the MEDIP Pro software (v2.0.0.0,MEDICALIP Co. Ltd., Seoul, Korea).The radiologist was blinded to histopathologic data and treatment outcomes, except for the diagnosis of rectal cancer and post-CRT status.Region of interest (ROI) was drawn along the margin of the tumor signal not to include outer non-rectal tissue and normal rectum on baseline MR.For segmentation on post-CRT MRI, baseline MRI was used to identify initial tumor extent, and a ROI was drawn covering the entire area of tumor bed including signal intensity demonstrating fibrosis or mucin.Data processing and dimension reduction: Before training the model, data preprocessing was performed to normalize the original data with the various range of the data value.The range of data value was adjusted to the values within the range of 0 to 1 using min-max scaling.In addition, principal component analysis (PCA) was performed on the scaled data to reduce dimensionality, because relatively small study population requires the reduction in the number of selected features to improve model performance.Six case datasets combined with 116 significant features and three clinical factors were refiltered using the Python model and PCA, respectively.
Classification model building for treatment outcomes: Grid search was conducted to identify a suitable model for the extracted features and determine the hyperparameter tuning condition for optimal training in binary classification.In this process, we assessed the generalization ability of the model through a threefold cross validation.The ensemble learning algorithms comprised boosted, bagged, subspace discriminant, subspace KNN, RUSBoost trees, random forest, logistic regression, and support vector machines (SVM) in our model (Table 1).Based on the grid-search results, we selected an optimal condition for each dataset.The total dataset was randomly divided into a training dataset for model training and a test dataset for model validation in a 7:3 ratio, with an equal ratio of the two response groups between the two data sets.In our study, the training process was conducted in a Python environment, and Python libraries were utilized to evaluate the performance, such as area under the receiver operating characteristic curve (AUROC), AUC score, accuracy sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) (Fig. 2).Furthermore, we confirmed feature importance for each dataset to observe the characteristics influencing training results (Fig. 3).

MR-assessed and pathologic tumor response
Two radiologists (with 15 and 25 years of experience in rectal imaging, respectively) reviewed the baseline and post-CRT T2WI and DWI with ADC maps in consensus.The reviewers were blinded to the clinical and pathological data, except for the post-CRT status of the rectal adenocarcinoma.mrTRG was assigned according to a five-point system [16] (Supplementary Table 2).mrTRG 1-3 and mrTRG 4-5 were classified as having good and poor responses, respectively, according to previous reports [6,17].The radiologists used baseline MR images to guide the delimitation of the tumor bed, and DWI with ADC map was additionally assessed when the reviewers were uncertain regarding the presence of a viable tumor.Another reviewer (with 18 years of experience in rectal imaging) determined T stage on baseline MR T2WI according to American Joint Committee on Cancer 8th edition [18], blinded to histopathologic data and treatment outcome.Clinical T stage was classified into T2/T3a, T3b/T3c/T3d, and T4a/T4b to increase the number of cases in each category and enhance comparability between the groups during classification model building.Interobserver agreement between the two reviewers were assessed additionally after more than 1 month of wash-out period.The pathological treatment response was determined according to the reported Dworak TRG system [19] (Supplementary Table 2), based on the results of the surgical pathological analysis.pTRG 3-4 and pTRG 0-2 were classified as having good and poor responses, respectively, according to previous reports [20,21].

Statistical analyses
Categorical data are presented as percentages, and numerical data are presented as means with standard deviations or medians with interquartile ranges (IQR), as appropriate.Indeterminate or missing mrTRG data were not included in the radiomics analysis.The model performance for predicting good or poor response based on mrTRG and pTRG were assessed using an AUROC analysis, and diagnostic performance of each model was determined by sensitivity, specificity, accuracy, PPV, and NPV.For each six model for either mrTRG or pTRG, AUC results of eight learning models were compared, and the best performance for each six dataset among them was selected.Interobserver agreement were evaluated using Cohen's kappa coefficient: 0.1-0.20 slight, 0.21-0.40fair, 0.41-0.60moderate, 0.61-0.80substantial, and 0.81-0.99near perfect.Statistical analyses were performed using Python 3.9.0 and Medcalc (ver.20.111 Medcalc software Ltd., Ostend, Belgium), and significance was set at P < .05.

Patient characteristics
The mean age of included patients was 62.8 ± 11.6, and male patients comprised 67.6% (100/148) of the included patients.The median CRT duration was 42 days (IQR 38-46 days), the median interval between CRT and post-CRT MR was 47.5 days (IQR 30-56), the median interval between post-CRT MR and surgery was 12 days (IQR 6-27.25), and the median serum CEA was 4.56 ng/ml (IQR 2.52-12.98).2.

Discussion
In our study, we evaluated a machine learning-based classification model using baseline MR radiomics for both mTRG and pTRG prediction, and the radiomics model demonstrated similar best classification performance for good or poor response between mrTRG-based and pTRG-based prediction (AUC 0.74).Moreover, combined radiomics with clinical data model presented enhanced best classification performance for both pTRGbased (0.79) and mrTRG-based response prediction (0.75) compared with those of radiomics only model.Recent studies on treatment response prediction using baseline MR radiomics mostly investigated the prediction of pCR [12,[22][23][24] or pTRG [25][26][27], and favorable model performances ranging between 0.75 and 0.84 have been reported.Delli Pizza et al. have reported similar results to ours, that is, superior performance of a combined MR radiomics and clinical features model compared with clinical features or radiomics-only models (0.79 vs. 0.68 and 0.7) [26].Although the clinical data component for model development was simpler than our study (serum CEA, T stage, and pathologic tumor grade), they also used solely T2WI for radiomics analysis and demonstrated enhanced model performance for combined clinical and radiomics data as well.As serum CEA levels can be a simple and good predictor of post-CRT treatment response [28,29], we recommend including serum CEA level as a clinical factor.Several studies have investigated multiparametric MR radiomics using not only T2WI but also T1 contrast-enhanced and DWI [12,22,24].However, we used only baseline T2WI for the radiomics analysis to reduce the feature dimension and analysis burden, thereby maintaining model performance with the present study population because increased feature dimension with additional DWI images may require a larger study population.T2WIs are key images for MR tumor evaluation and more appropriate for contrast and texture analysis; therefore, we investigated T2WI-based radiomics analysis by priority.

E. F.
Research on radiomics for mTRG prediction has been limited.Van Griethuysen et al. developed a radiomics model using baseline MR for MR-assessed response prediction and reported AUCs of 0.69-0.79[30].However, they developed their own MR-assessed response criteria that comprised several morphological features and tumor-node stages.We used the mrTRG system, which was developed by the MERCURY group and is widely used for treatment response assessments worldwide, including in our institution.MR-assessed TRG is a useful surrogate marker for pathologic TRG for response assessment because it can be used without or before surgery; therefore, it can guide treatment follow-up strategies and select candidates for organ preservation.Hence, we investigated the radiomics prediction model for mrTRG and observed a similar performance to that of pTRG.Therefore, our results might broaden the field of MR radiomics analysis and contribute to preoperative and noninvasive guidance of treatment planning in patients with post-CRT rectal cancer.
We established good-response groups that were slightly different between the mrTRG and pTRG groups.For mrTRG, mrTRG 1-3 was defined as a good response, including the gray zone in the good-response group.Several studies defined a good response as mrTRG 1-3 [5,6,17], whereas others defined a good response as mrTRG 1-2 [31,32].We defined the good response group as a wider range, including mrTRG 3, to build a screening model for selecting candidate for watch-and-wait approach.This screening approach can assist radiologists and referring physicians in making decisions for patients.During the training process of radiomics analysis, we used ensemble method comprising eight kinds of models including boosted, bagged, random forest, logistic regression, RUSBoost trees, subspace KNN, subspace discriminant, and SVM for model building.The hold-out method was utilized to check split dataset with train/test of 7:3 ratio, and threefold cross-validation method was used for all data validation process.These techniques are proper to reduce the effect of limited input data and increase the validation reliability [33].
Our study has several limitations.First, this was a single-center retrospective study, which may indicate a potential selection bias and limited generalizability.Second, MR images were acquired from two different vendors (Philips and Siemens); however, both were 3T, and all images were obtained with the same slice thickness (3 mm).Furthermore, the extracted radiomic features were scaled using min-max scaling prior to training.Third, not all patients underwent surgical resection, although we aimed to evaluate the preoperative and non-invasive prediction of post-CRT response; therefore, mTRG was the final endpoint of the classification model, as well as pTRG in the subgroup that received surgery.Fourth, we did not validate an outer cohort; however, we evaluated the model performance with the test set separated from the training set during model development.
Our model performance was evaluated using the unseen test set data.Lastly, we did not consider sequences other than T2W images, such as DWI or ADC, for the radiomics analysis.We focused on T2W image analysis, which is the most important sequence for tumor evaluation on rectal MR.. Further work with additional sequences would improve model performance.
In conclusion, the machine learning-based radiomics model using baseline T2WI, especially when combined with clinical data, demonstrated favorable performance in predicting good or poor response after CRT in patients with rectal cancer.This combined clinical and radiomics data model was useful for predicting treatment response based on both mrTRG and pTRG, thereby supporting communication with patients or referring physicians, screening candidates for organ preservation, and frequent and meticulous follow-up during treatment.
This retrospective study was approved by the institutional review board of Kyung Hee University Hospital (IRB No. 2023-12-023) and waived the requirement for informed consent.Patients who underwent baseline rectal MRI between January 2010 and May 2021 were enrolled consecutively.Patients with LARC (T2-4 or N+) who underwent NCRT and MRI after CRT at our institution were included in the study.The exclusion criteria were as follows: (1) did not receive neoadjuvant chemotherapy or/and radiotherapy (n = 418); (2) did not undergo post-CRT MRI (n = 5); (3) with other cancer types (n = 5, anal squamous cell carcinoma; n = 3, mucinous cancer); (4) without a visible tumor after transanal excision (n = 2); (5) received palliative chemotherapy or palliative surgery for distant metastasis (n = 19); ( Radiomics feature extraction and creation of dataset: All radiomic features, including volume, shape, intensity, and texture, were extracted from the original image.Features were extracted for T2WI using the MEDIP Pro software based on PyRadiomics texture measures comprising 2D size and shape-based features (n = 9), 3D size and shape-based features (n = 14), histogram-based first-order statistical features (n = 18), second-order statistical features, and graylevel co-occurrence matrix features regarding the relationships between image voxels (n = 75) [15].To create dataset for machine learning model, the extracted feature values with 116 and three clinical factors composed of serum CEA, tumor differentiation grade, and cT stage were used.The datasets were constructed to six types: clinical factors only, radiomics features of baseline MR, radiomics features with serum CEA, radiomics features with cT stage, radiomics features with tumor differentiation grade, and combined radiomics features with three clinical factors.

Fig. 3
Fig. 3 Feature importance plots for six models.A Clinical data model.B Radiomics model.C Radiomics with serum CEA model.D Radiomics with cT stage model.E Radiomics with tumor differentia-

Fig. 4 Table 4
Fig. 4 Predictive performance of the machine learning models in differentiating good (mrTRG 1-3) and poor (mrTRG 4-5) response group.A Clinical data model.B Radiomics model.C Radiomics with serum CEA model.D Radiomics with cT stage model.E Radiomics with tumor differentiation grade model.F Combined radiomics with all clinical data model ◂

Fig. 5 Fig. 6 A
Fig. 5 Predictive performance of the machine learning models in differentiating good (pTRG 3-4) and poor(pTRG 0-2) response group.A Clinical data model.B Radiomics model.C Radiomics with serum CEA model.D Radiomics with cT stage model.E Radiomics with tumor differentiation grade model.F Combined radiomics with all clinical data model ◂

Fig. 7
Fig.7A 46-year-old male patient in poor response group.Serum CEA was 15.34 ng/ml, clinical T stage was T3c and moderately differentiated adenocarcinoma at baseline.mrTRG was graded as 4, and Dworak pTRG was grade 1 (minimal regression, ypT3).A Irregularshaped circumferentially encircling mass at baseline T2W oblique axial image.B postCRT MR shows decreased volume and SI of cancer mass; however, considerable remnant tumor was exist; therefore,

Table 2
Clinical data of included patientsSD standard deviation, SCC squamous cell carcinoma, CRT chemoradiation therapy, d days, CEA carcinoembryonic antigen, TRG tumor regression grade

Table 3
Performance of six clinical, radiomics, and combined models in the differentiating good and poor response groups based on mrTRG AUC area under the curve, SD-AUC standard deviation of AUC, PPV positive predictive value, NPV negative predictive value